Fast Approximate Title Matching

نویسندگان

Mirko Brodesser

Hannah Bast

Marjan Celikik

چکیده

Given a set of titles and a query, we want to find the title with the largest similarity to the query. The queries may contain spelling mistakes and words which are not part of the correct entry. Furthermore, no word separators in the query are expected and words of the correct title may be missing. Titles are usually short and consist of multiple words and numbers. We analyse several existing similarity measures like Edit distance and Jaccard similarity. Besides the quality, efficiency is the other requirement of our algorithm and therefore we analyse and compare the best existing algorithms for efficient fuzzy match [CGGM03] and efficient duplicate detection [XWLY08]. We propose a new similarity measure as well as algorithm for its efficient computation in order to improve the existing algorithms in terms of quality and efficiency. We show on two different collections that our algorithm is in most of the cases more efficient than the related work, yet achieves better quality.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

A Fast Algorithm for Approximate String Matching on Gene Sequences

Approximate string matching is a fundamental and challenging problem in computer science, for which a fast algorithm is highly demanded in many applications including text processing and DNA sequence analysis. In this paper, we present a fast algorithm for approximate string matching, called FAAST. It aims at solving a popular variant of the approximate string matching problem, the k-mismatch p...

متن کامل

Fast Least Square Matching

Least square matching (LSM) is one of the most accurate image matching methods in photogrammetry and remote sensing. The main disadvantage of the LSM is its high computational complexity due to large size of observation equations. To address this problem, in this paper a novel method, called fast least square matching (FLSM) is being presented. The main idea of the proposed FLSM is decreasing t...

متن کامل

Fast approximate string matching with finite automata

We present a fast algorithm for finding approximate matches of a string in a finite-state automaton, given some metric of similarity. The algorithm can be adapted to use a variety of metrics for determining the distance between two words.

متن کامل

SimSem: Fast Approximate String Matching in Relation to Semantic Category Disambiguation

In this study we investigate the merits of fast approximate string matching to address challenges relating to spelling variants and to utilise large-scale lexical resources for semantic class disambiguation. We integrate string matching results into machine learning-based disambiguation through the use of a novel set of features that represent the distance of a given textual span to the closest...

متن کامل

A Fast Heuristic forApproximate String Matching 2

We study a fast algorithm for on-line approximate string matching. It is based on a non-deterministic nite automaton, which is simulated using bit-parallelism. If the automaton does not t in a computer word, we partition the problem into subproblems. We show experimentally that this algorithm is the fastest for typical text search. We also show which algorithms are the best in other cases, and ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Fast Approximate Title Matching

نویسندگان

چکیده

منابع مشابه

Adaptive Approximate Record Matching

A Fast Algorithm for Approximate String Matching on Gene Sequences

Fast Least Square Matching

Fast approximate string matching with finite automata

SimSem: Fast Approximate String Matching in Relation to Semantic Category Disambiguation

A Fast Heuristic forApproximate String Matching 2

عنوان ژورنال:

اشتراک گذاری